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Abstract 

Lasso, or i 1 regularized least squares, has been explored extensively for its remarkable sparsity 
properties. It is shown in this paper that the solution to Lasso, in addition to its sparsity, has robustness 
properties: it is the solution to a robust optimization problem. This has two important consequences. 
First, robustness provides a connection of the regularizer to a physical property, namely, protection from 
noise. This allows a principled selection of the regularizer, and in particular, generalizations of Lasso 
that also yield convex optimization problems are obtained by considering different uncertainty sets. 

Secondly, robustness can itself be used as an avenue to exploring different properties of the solution. 
In particular, it is shown that robustness of the solution explains why the solution is sparse. The analysis 
as well as the specific results obtained differ from standard sparsity results, providing different geometric 
intuition. Furthermore, it is shown that the robust optimization formulation is related to kernel density 
estimation, and based on this approach, a proof that Lasso is consistent is given using robustness directly. 
Finally, a theorem saying that sparsity and algorithmic stability contradict each other, and hence Lasso 
is not stable, is presented. 

Index Terms 

Statistical Learning, Regression, Regularization, Kernel density estimator, Lasso, Robustness, Spar- 
sity, Stability. 

I. Introduction 

In this paper we consider linear regression problems with least-square error. The problem is 
to find a vector x so that the £ 2 norm of the residual b — Ax. is minimized, for a given matrix 
A E M. nxm and vector b E W 1 . From a learning/regression perspective, each row of A can be 
regarded as a training sample, and the corresponding element of b as the target value of this 
observed sample. Each column of A corresponds to a feature, and the objective is to find a set 
of weights so that the weighted sum of the feature values approximates the target value. 

It is well known that minimizing the least squared error can lead to sensitive solutions [1]- 
[4]. Many regularization methods have been proposed to decrease this sensitivity. Among them, 
Tikhonov regularization [5] and Lasso [6], [7] are two widely known and cited algorithms. These 
methods minimize a weighted sum of the residual norm and a certain regularization term, ||x|| 2 
for Tikhonov regularization and ||x||i for Lasso. In addition to providing regularity, Lasso is also 
known for the tendency to select sparse solutions. Recently this has attracted much attention for 
its ability to reconstruct sparse solutions when sampling occurs far below the Nyquist rate, and 
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also for its ability to recover the sparsity pattern exactly with probability one, asymptotically 
as the number of observations increases (there is an extensive literature on this subject, and we 
refer the reader to [8]— [12] and references therein). 

The first result of this paper is that the solution to Lasso has robustness properties: it is the 
solution to a robust optimization problem. In itself, this interpretation of Lasso as the solution 
to a robust least squares problem is a development in line with the results of [13]. There, the 
authors propose an alternative approach of reducing sensitivity of linear regression by considering 
a robust version of the regression problem, i.e., minimizing the worst-case residual for the 
observations under some unknown but bounded disturbance. Most of the research in this area 
considers either the case where the disturbance is row-wise uncoupled [14], or the case where 
the Frobenius norm of the disturbance matrix is bounded [13]. 

None of these robust optimization approaches produces a solution that has sparsity properties 
(in particular, the solution to Lasso does not solve any of these previously formulated robust 
optimization problems). In contrast, we investigate the robust regression problem where the 
uncertainty set is defined by feature-wise constraints. Such a noise model is of interest when 
values of features are obtained with some noisy pre-processing steps, and the magnitudes of such 
noises are known or bounded. Another situation of interest is where features are meaningfully 
coupled. We define coupled and uncoupled disturbances and uncertainty sets precisely in Section 
III- Al below. Intuitively, a disturbance is feature-wise coupled if the variation or disturbance across 
features satisfy joint constraints, and uncoupled otherwise. 

Considering the solution to Lasso as the solution of a robust least squares problem has two 
important consequences. First, robustness provides a connection of the regularizer to a physical 
property, namely, protection from noise. This allows more principled selection of the regularizer, 
and in particular, considering different uncertainty sets, we construct generalizations of Lasso 
that also yield convex optimization problems. 

Secondly, and perhaps most significantly, robustness is a strong property that can itself be used 
as an avenue to investigating different properties of the solution. We show that robustness of the 
solution can explain why the solution is sparse. The analysis as well as the specific results we 
obtain differ from standard sparsity results, providing different geometric intuition, and extending 
beyond the least-squares setting. Sparsity results obtained for Lasso ultimately depend on the 
fact that introducing additional features incurs larger i 1 -penalty than the least squares error 
reduction. In contrast, we exploit the fact that a robust solution is, by definition, the optimal 
solution under a worst-case perturbation. Our results show that, essentially, a coefficient of the 
solution is nonzero if the corresponding feature is relevant under all allowable perturbations. In 
addition to sparsity, we also use robustness directly to prove consistency of Lasso. 

We briefly list the main contributions as well as the organization of this paper. 

• In Section HH we formulate the robust regression problem with feature-wise independent 
disturbances, and show that this formulation is equivalent to a least-square problem with a 
weighted t\ norm regularization term. Hence, we provide an interpretation of Lasso from 
a robustness perspective. 

• We generalize the robust regression formulation to loss functions of arbitrary norm in 
Section Unl We also consider uncertainty sets that require disturbances of different features 
to satisfy joint conditions. This can be used to mitigate the conservativeness of the robust 
solution and to obtain solutions with additional properties. 

. In Section [IV] we present new sparsity results for the robust regression problem with 
feature-wise independent disturbances. This provides a new robustness-based explanation 
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to the sparsity of Lasso. Our approach gives new analysis and also geometric intuition, and 
furthermore allows one to obtain sparsity results for more general loss functions, beyond 
the squared loss. 

• Next, we relate Lasso to kernel density estimation in Section [V] This allows us to re-prove 
consistency in a statistical learning setup, using the new robustness tools and formulation 
we introduce. Along with our results on sparsity, this illustrates the power of robustness in 
explaining and also exploring different properties of the solution. 

• Finally, we prove in Section [VI] a "no-free-lunch" theorem, stating that an algorithm that 
encourages sparsity cannot be stable. 

Notation. We use capital letters to represent matrices, and boldface letters to represent column 
vectors. Row vectors are represented as the transpose of column vectors. For a vector z, z% 
denotes its i th element. Throughout the paper, and rj are used to denote the i th column and 
the j th row of the observation matrix A, respectively. We use to denote the ij element of 
A, hence it is the j th element of r», and i th element of a.,. For a convex function /(•), df(z) 
represents any of its sub-gradients evaluated at z. A vector with length n and each element 
equals 1 is denoted as l n . 



II. Robust Regression with Feature-wise Disturbance 

In this section, we show that our robust regression formulation recovers Lasso as a special 
case. We also derive probabilistic bounds that guide in the construction of the uncertainty set. 

The regression formulation we consider differs from the standard Lasso formulation, as we 
minimize the norm of the error, rather than the squared norm. It is known that these two coincide 
up to a change of the regularization coefficient. Yet as we discuss above, our results lead to more 
flexible and potentially powerful robust formulations, and give new insight into known results. 



A. Formulation 

Robust linear regression considers the case where the observed matrix is corrupted by some 
potentially malicious disturbance. The objective is to find the optimal solution in the worst case 
sense. This is usually formulated as the following min-max problem, 

Robust Linear Regression: 

(1) 



min < max lib — (A + AA)x|U 

where U is called the uncertainty set, or the set of admissible disturbances of the matrix A. In 
this section, we consider the class of uncertainty sets that bound the norm of the disturbance 
to each feature, without placing any joint requirements across feature disturbances. That is, we 
consider the class of uncertainty sets: 

U= ,S m ) \\Si\\ 2 < a, i = l,---,m\, (2) 

for given c* > 0. We call these uncertainty sets feature-wise uncoupled, in contrast to coupled 
uncertainty sets that require disturbances of different features to satisfy some joint constraints (we 
discuss these extensively below, and their significance). While the inner maximization problem 
of CQ) is nonconvex, we show in the next theorem that uncoupled norm-bounded uncertainty sets 
lead to an easily solvable optimization problem. 
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Theorem 1: The robust regression problem (QQ) with uncertainty set of the form © is equiv- 
alent to the following £ 1 regularized regression problem: 



mm 

xe 



m 

in \ lib - Ax|| 2 + ^ Ci\xi\ \. 

1=1 



Proof: Fix x*. We prove that max AAeVl ||b - (A + AA)x*|| 2 
The left hand side can be written as 

max lib - (A + AA)x*|| 2 



(3) 



|b- Ax*|| 2 + ]>Xl C iK*l 



max 

($!,••• ,<5ro)|||<5i[|2<Cj 



max 

($!,■■■ ,<5m)|||<5i||2<Ci 



< max 

(<5i,— ,<S ro )|||<Sj|| 2 <Cj 



b- (A+(6 ir -- ,S m ))x* 
in 

b- Ax* - ^x*di\\ 2 

i=l 

m 

h-Ax*\ +^||x*^|| 2 



i=l 



<||b- Ax* 



Now, let 



and let 



u 



if Ax.* ^ b, 



\\b-Ax*\\_ 

any vector with unit £ 2 norm otherwise; 

5* = -QSgn(x*)u. 

A 



Observe that \\S*\\ 2 < c h hence AA* = (<5*, • • • , 5£J e W. Notice that 

max lib - M + AA)x*|| 2 

/S.A&A 

>||b- (A + AA*)x*|| 2 

b-{A + (dl,--- ,d*J)x* 
in 

(b - Ax*) - ( - x*CiSgn(x*)u) 

i=l 

m 

(b-i4x*) + (J] 

i=l 
m 

b- AX*|| 2 + 53 c ii x ii- 



i=l 



(4) 



(5) 



The last equation holds from the definition of u. 

Combining Inequalities © and ©, establishes the equality max^/iew II b — (A + AA)x*|| 2 = 
|| b — Ax*|| 2 + J^™^ Ci\x*\ for any x*. Minimizing over x on both sides proves the theorem. ■ 
Taking q = c and normalizing for all i, Problem © recovers the well-known Lasso [6], [7]. 
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B. Uncertainty Set Construction 

The selection of an uncertainty set hi in Robust Optimization is of fundamental importance. 
One way this can be done is as an approximation of so-called chance constraints, where a 
deterministic constraint is replaced by the requirement that a constraint is satisfied with at least 
some probability. These can be formulated when we know the distribution exactly, or when we 
have only partial information of the uncertainty, such as, e.g., first and second moments. This 
chance-constraint formulation is particularly important when the distribution has large support, 
rendering the naive robust optimization formulation overly pessimistic. 

For confidence level 77, the chance constraint formulation becomes: 

minimize: t 

Subject to: Pr(||b - (A + AA)x\\ 2 < t) > 1 - 77. 

Here, x and t are the decision variables. 

Constructing the uncertainty set for feature i can be done quickly via line search and bisection, 
as long as we can evaluate Pr(||aj|| 2 > c). If we know the distribution exactly (i.e., if we have 
complete probabilistic information), this can be quickly done via sampling. Another setting of 
interest is when we have access only to some moments of the distribution of the uncertainty, 
e.g., the mean and variance. In this setting, the uncertainty sets are constructed via a bisection 
procedure which evaluates the worst-case probability over all distributions with given mean and 
variance. We do this using a tight bound on the probability of an event, given the first two 
moments. 

In the scalar case, the Markov Inequality provides such a bound. The next theorem is a gener- 
alization of the Markov inequality to M. n , which bounds the probability where the disturbance on 
a given feature is more than q, if only the first and second moment of the random variable are 
known. We postpone the proof to the appendix, and refer the reader to [15] for similar results 
using semi-definite optimization. 

Theorem 2: Consider a random vector v £ R n , such that E(v) = a, and E(vv T ) = S, E >z 0. 
Then we have 

minp )qjrj A Trace(SP) + 2q T a + r 



Pr{||v|| 2 >Ci}<< 



P q 

subject to: I T 1^0 



I{m) \ , . / P q (6) 



T — cf J — " V q 1 r 
A > 0. 



The optimization problem © is a semi-definite programming, which is known be solved in 
polynomial time. Furthermore, if we replace E(vv T ) = S by an inequality E(vv T ) < S, the 
uniform bound still holds. Thus, even if our estimation to the variance is not precise, we are 
still able to bound the probability of having "large" disturbance. 

III. General Uncertainty Sets 

One reason the robust optimization formulation is powerful, is that having provided the connec- 
tion to Lasso, it then allows the opportunity to generalize to efficient "Lasso-like" regularization 
algorithms. 



November 11, 2008 



DRAFT 



6 



In this section, we make several generalizations of the robust formulation (OQ) and derive 
counterparts of Theorem [Q We generalize the robust formulation in two ways: (a) to the case 
of arbitrary norm; and (b) to the case of coupled uncertainty sets. 

We first consider the case of an arbitrary norm || ■ || a of M. n as a cost function rather than the 
squared loss. The proof of the next theorem is identical to that of Theorem [Q with only the £ 2 
norm changed to || ■ || a . 

Theorem 3: The robust regression problem 



mm 

xeM m 



\ max ||b - (A + AA)x||„ \ ; U a = { (Sx, ■■■ ,6 m ) \\Si\\ a < q, % = 1,-- • ,m\; 



is equivalent to the following regularized regression problem 

m 

min s lib — AxIL + y cAxA 

xeK m I ^— ' 

i=l 

We next remove the assumption that the disturbances are feature-wise uncoupled. Allowing 
coupled uncertainty sets is useful when we have some additional information about potential 
noise in the problem, and we want to limit the conservativeness of the worst-case formulation. 
Consider the following uncertainty set: 

W 4 {(5 U ■ • • , * m )|/;(ll*i|la> • ■ ■ , \\S m \\ a ) < 0; j = 1, • • • , k} , 

where fj(-) are convex functions. Notice that, both k and fj can be arbitrary, hence this is a 
very general formulation, and provides us with significant flexibility in designing uncertainty sets 
and equivalently new regression algorithms (see for example Corollary [Hand 0. The following 
theorem converts this formulation to tractable optimization problems. The proof is postponed to 
the appendix. 

Theorem 4: Assume that the set 

Z ± {z e R m |/ j (z) < 0, j = l,...,k; z>0} 

has non-empty relative interior. Then the robust regression problem 

min < max lib — (A + AA)x|L 

is equivalent to the following regularized regression problem 

|||b - Ax\\ a + v(X, k,x)|; 



mm 



where: v(X, k, x) = max 
ceR m 



; K +|x|) T c-^A^(c) 



(7) 



Remark: Problem © is efficiently solvable. Denote z c (A, k, x) = |^(k+|x|) t c— Y^!j=i ^jfj( c ) 
This is a convex function of (A, re, x), and the sub-gradient of z c (-) can be computed easily for 
any c. The function v(X, re, x) is the maximum of a set of convex functions, z c (-) , hence is 
convex, and satisfies 

<%(A*,re*,x*) =«9z C0 (A*,re*,x*), 
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where c maximizes (k* + |x|*) T c — J2j=i ^*jfj( c ) ■ We can efficiently evaluate c due to 
convexity of fj(-), and hence we can efficiently evaluate the sub-gradient of v(-). 
The next two corollaries are a direct application of Theorem |4j 

Corollary 1: Supposed/' = |(<5i, • • • , <5 m ) || ||<5i|| a , • • ■ , ||<5 m || || s < h } for a symmetric norm 
|| ■ || s , then the resulting regularized regression problem is 



mm 

xeK m 



| ||b — Ax|| a + /||x||*|; where || ■ ||* is the dual norm of 



This corollary interprets arbitrary norm-based regularizers from a robust regression perspective. 
For example, it is straightforward to show that if we take both || • \\ a and || • || s as the Euclidean 
norm, then W is the set of matrices with their Frobenious norms bounded, and Corollary Q] 
reduces to the robust formulation introduced by [13]. 



Corollary 2: Suppose W = |(<5i, • • • ,6 
ing regularized regression problem is 

Minimize: 
Subject to: 



3c > : Tc < s; 



, then the result- 



||b-Ax|| a 4 
x < T T A 
-x < T T A 
A > 0. 



s T A 



Unlike previous results, this corollary considers general polytope uncertainty sets. Advantages 
of such sets include the linearity of the final formulation. Moreover, the modeling power is 
considerable, as many interesting disturbances can be modeled in this way. 

We briefly mention some further examples meant to illustrate the power and flexibility of the 
robust formulation. We refer the interested reader to [16] for full details. 

As the results above indicate, the robust formulation can model a broad class of uncertainties, 
and yield computationally tractable (i.e., convex) problems. In particular, one can use the polytope 
uncertainty discussed above, to show (see [16]) that by employing an uncertainty set first used 
in [17], we can model cardinality constrained noise, where some (unknown) subset of at most 
k features can be corrupted. 

Another avenue one may take using robustness, and which is also possible to solve easily, is 
the case where the uncertainty set allows independent perturbation of the columns and the rows 
of the matrix A. The resulting formulation resembles the elastic-net formulation [18], where 
there is a combination of £ 2 and 1 1 regularization. 



IV. Sparsity 

In this section, we investigate the sparsity properties of robust regression (0Q), and equivalently 
Lasso. Lasso's ability to recover sparse solutions has been extensively studied and discussed (cf 
[8] — [1 1]). There are generally two approaches. The first approach investigates the problem from a 
statistical perspective. That is, it assumes that the observations are generated by a (sparse) linear 
combination of the features, and investigates the asymptotic or probabilistic conditions required 
for Lasso to correctly recover the generative model. The second approach treats the problem 
from an optimization perspective, and studies under what conditions a pair (A, b) defines a 
problem with sparse solutions (e.g., [19]). 
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We follow the second approach and do not assume a generative model. Instead, we consider 
the conditions that lead to a feature receiving zero weight. Our first result paves the way for 
the remainder of this section. We show in Theorem [5] that, essentially, a feature receives no 
weight (namely, x* = 0) if there exists an allowable perturbation of that feature which makes it 
irrelevant. This result holds for general norm loss functions, but in the £ 2 case, we obtain further 
geometric results. For instance, using Theorem [5l we show, among other results, that "nearly" 
orthogonal features get zero weight (Theorem [6]). Using similar tools, we provide additional 
results in [16]. There, we show, among other results, that the sparsity pattern of any optimal 
solution must satisfy certain angular separation conditions between the residual and the relevant 
features, and that "nearly" linearly dependent features get zero weight. 

Substantial research regarding sparsity properties of Lasso can be found in the literature (cf 
[8]— [11], [20]-[23] and many others). In particular, similar results as in point (a), that rely on 
an incoherence property, have been established in, e.g., [19], and are used as standard tools 
in investigating sparsity of Lasso from the statistical perspective. However, a proof exploiting 
robustness and properties of the uncertainty is novel. Indeed, such a proof shows a fundamental 
connection between robustness and sparsity, and implies that robustifying w.r.t. a feature-wise 
independent uncertainty set might be a plausible way to achieve sparsity for other problems. 

To state the main theorem of this section, from which the other results derive, we introduce 
some notation to facilitate the discussion. Given a feature-wise uncoupled uncertainty set, U, an 
index subset / C {1, . . . , n}, and any AA E U, let AA 1 denote the element of U that equals 
AA on each feature indexed by i E I, and is zero elsewhere. Then, we can write any element 
AA E hi as A A 1 + AA IC (where I c = {1, . . . , n} \ I). Then we have the following theorem. We 
note that the result holds for any norm loss function, but we state and prove it for the i 2 norm, 
since the proof for other norms is identical. 

Theorem 5: The robust regression problem 



min < max lib — (A + Av4)x|| 2 

xGK m [AAeU 

has a solution supported on an index set / if there exists some perturbation AA 1 " E U of the 
features in I c , such that the robust regression problem 



mm 

X 



in \ max lib - (A + AA 1 " + AA 7 )x|| 2 I 



has a solution supported on the set /. 

Thus, a robust regression has an optimal solution supported on a set /, if any perturbation of the 
features corresponding to the complement of / makes them irrelevant. Theorem [5] is a special 
case of the following theorem with Cj = for all j I: 

Theorem [Sf . Let x* be an optimal solution of the robust regression problem: 

min < max lib — (A + Av4)x|| 2 

and let I C {1, ■ • • , m} be such that x* = V j ^ /. Let 

U = |(<5i, ■ • • ,d m ) \\8i\\ 2 < (k, i e I; \\&jh < cj + lj, j & Ij- 
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Then, x* is an optimal solution of 



mm < max 



|b-(i + AA)x| 



for any A that satisfies 
Proof: Notice that 



I a i a i I 



< lj for j £ I, and a.j = a» for i 6 I. 



max 

AAe« 



max 

AAeU 



max 

AAeU 



These equalities hold because for j ^ I, x* 
no effect on the residual. 
For an arbitrary x , we have 



max 

AAeW 

> max 
aagw 



(A + AA)x* 
(A + AA)x* 
{A + AA)x* 
0, hence the j th column of both A and A A has 



(A + AA)x' 
{A + AA)x' 



This is because, ||a.j — a.j|| < lj for j ^ I, and = a« for i E I. Hence, we have 

{A + AA\AA eU} C {A + AA\AA e U). 

Finally, notice that 



max 

AAeU 



Therefore we have 



max 

AAahi 



b - (A + AA)x* 



b - (A + AA)x* 



< max 

2 AAeU 



< max 

2 AAeW 



b - (A + AA)x' 



b - (A + AA)x' 



Since this holds for arbitrary x', we establish the theorem. ■ 
We can interpret the result of this theorem by considering a generative modef] b = w i a i + 
£ where / C {1 ■ • • ,m} and £ is a random variable, i.e., b is generated by features belonging 
to /. In this case, for a feature j ^ /, Lasso would assign zero weight as long as there exists a 
perturbed value of this feature, such that the optimal regression assigned it zero weight. 

When we consider I 2 loss, we can translate the condition of a feature being "irrelevant" into 
a geometric condition, namely, orthogonality. We now use the result of Theorem |5] to show 
that robust regression has a sparse solution as long as an incoherence-type property is satisfied. 
This result is more in line with the traditional sparsity results, but we note that the geometric 
reasoning is different, and ours is based on robustness. Indeed, we show that a feature receives 
zero weight, if it is "nearly" (i.e., within an allowable perturbation) orthogonal to the signal, and 



'While we are not assuming generative models to establish the results, it is still interesting to see how these results can help 
in a generative model setup. 
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all relevant features. 

Theorem 6: Let c, = c for all i and consider £ 2 loss. If there exists / C {1, • ■ • , m} such 
that for all v £ span({aj, i £ 1} |^J{b}) , ||v|| = 1, we have v T a^ < c, Vj £" /, then any optimal 
solution x* satisfies x* = 0, Vj £ I. 

Proof: For j £" /, let a~ denote the projection of a,- onto the span of {a, ; , i E 1} {J{b}, 
and let a+ = a, — a~. Thus, we have ||a~|| < c. Let A be such that 

a.; % £ /; 
a+ t £" /. 

Now let 

W = {(S u • • • , 5 m )|||^|| 2 < c, ^ £ I; H^-lla = 0, j ^ /}. 

Consider the robust regression problem minx j max Ayiew H^ 3 — + ^^^IL}' wn i cn ^ s equiv- 
alent to min x |||b — Ax|| 2 + J2i^i c \^i\}- Note that the a, are orthogonal to the span of 
{ki, i £ 1} |J{b}- Hence for any given x, by changing Xj to zero for all j ^ I, the minimizing 
objective does not increase. 

Since ||a — eLj|| = ||a~|| < c Vj £" /, (and recall that U = {(<5i,--- , S m )\ \\Si\\ 2 < c, Vz}) 
applying Theorem \5\ concludes the proof. ■ 

V. Density Estimation and Consistency 

In this section, we investigate the robust linear regression formulation from a statistical 
perspective and rederive using only robustness properties that Lasso is asymptotically consistent. 
The basic idea of the consistency proof is as follows. We show that the robust optimization 
formulation can be seen to be the maximum error w.r.t. a class of probability measures. This 
class includes a kernel density estimator, and using this, we show that Lasso is consistent. 

A. Robust Optimization, Worst-case Expected Utility and Kernel Density Estimator 

In this subsection, we present some notions and intermediate results. In particular, we link 
a robust optimization formulation with a worst expected utility (w.r.t. a class of probability 
measures); we then briefly recall the definition of a kernel density estimator. Such results will 
be used in establishing the consistency of Lasso, as well as providing some additional insights 
on robust optimization. Proofs are postponed to the appendix. 

We first establish a general result on the equivalence between a robust optimization formulation 
and a worst-case expected utility: 

Proposition 1: Given a function g : R m+1 -> E and Borel sets Z x , ■ ■ ■ , Z n C R m+1 , let 

V n 4 {/i £ V\VS Q{l,---,n}:fjL(\JZi)> \S\/n}. 



The following holds 



1 n f 

— y sup h(ri, hi) = sup / h(r,b)df/,(r,b). 



This leads to the following corollary for Lasso, which states that for a given x, the robust 
regression loss over the training data is equal to the worst-case expected generalization error. 
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Corollary 3: Given b £ W 1 , A £ ]R nxm , the following equation holds for any x £ M. m , 



||b — Ax|| 2 + \/nc n ||x||i + \fnc n = sup Wn / — r /T x) 2 rf/i(r / , 6'). (8) 

HereS 

P(n)^ |J ?»(4,A,b )( r); 

||o-||2<\/"c n ; Vi:||6i||2<Vnc„ 

m 

V n (A, A, b, cr) = {/i £ = [hi - ai, k + cr*] x JJjay - <%, c%- + <%] ; 

i=i 

V5C{l,...,n}:/i(|J^)>|5|/n}. 

Remark 1: We briefly explain Corollary [3] to avoid possible confusions. Equation ([8]) is a 
non-probabilistic equality. That is, it holds without any assumption (e.g., i.i.d. or generated by 
certain distributions) on b and A. And it does not involve any probabilistic operation such as 
taking expectation on the left-hand- side, instead, it is an equivalence relationship which hold for 
an arbitrary set of samples. Notice that, the right-hand- side also depends on the samples since 
V{n) is defined through A and b. Indeed, V(n) represents the union of classes of distributions 
V n (A, A, b, cr) such that the norm of each column of A is bounded, where V n (A, A, b, cr) is 
the set of distributions corresponds to (see Proposition [1} disturbance in hyper-rectangle Borel 
sets Zi, ■■ ■ ,Z n centered at (fej, rj) with lengths (2cTj, 26a, ■ ■ ■ , 25 im ). 

We will later show that P n consists a kernel density estimator. Hence we recall here its 
definition. The kernel density estimator for a density h in IR d , originally proposed in [24], [25], 
is defined by 

n 

h n (x) = (nc d n )~ l J2K 

i=X 

where {c n } is a sequence of positive numbers, % are i.i.d. samples generated according to /, 
and K is a Borel measurable function (kernel) satisfying K > 0, j K = 1. See [26], [27] and 
the reference therein for detailed discussions. Figure \T\ illustrates a kernel density estimator using 
Gaussian kernel for a randomly generated sample-set. A celebrated property of a kernel density 
estimator is that it converges in C 1 to h when c n { and nc^ f oo [26]. 

B. Consistency of Lasso 

We restrict our discussion to the case where the magnitude of the allowable uncertainty for all 
features equals c, (i.e., the standard Lasso) and establish the statistical consistency of Lasso from 
a distributional robustness argument. Generalization to the non-uniform case is straightforward. 
Throughout, we use c n to represent c where there are n samples (we take c n to zero). 

Recall the standard generative model in statistical learning: let P be a probability measure 
with bounded support that generates i.i.d samples (&«,rj), and has a density /*(•). Denote the 

2 Recall that atj is the j element of ri 
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Fig. 1. Illustration of Kernel Density Estimation. 



set of the first n samples by S n . Define 



1 n 

x(c n ,S n ) = argmin \ 

\ »=i 



rjx) 2 + Cn || X ||i 



arg mm 

x i. n 



- rjx) 2 + c n \\x\\i \ 



arg mm 

X 




"x)W(6, r)}. 



In words, x(c n , «S n ) is the solution to Lasso with the tradeoff parameter set to c n -Jn, and x(P) is 
the "true" optimal solution. We have the following consistency result. The theorem itself is a well- 
known result. However, the proof technique is novel. This technique is of interest because the 
standard techniques to establish consistency in statistical learning including Vapnik-Chervonenkis 
(VC) dimension (e.g., [28]) and algorithmic stability (e.g., [29]) often work for a limited range 
of algorithms, e.g., the A;-Nearest Neighbor is known to have infinite VC dimension, and we 
show in Section |VT] that Lasso is not stable. In contrast, a much wider range of algorithms have 
robustness interpretations, allowing a unified approach to prove their consistency. 

Theorem 7: Let {c n } be such that c n j and lim^^ n(c n ) m+1 = oo. Suppose there exists a 
constant H such that ||x(c n , S n ) |U < H. Then, 



Jim J jf (b - rT x (c re , S n )) 2 d¥(b, r) = ^ (b - r T x (P))2 
almost surely. 

Proof: Step 1: We show that the right hand side of Equation © includes a kernel density 
estimator for the true (unknown) distribution. Consider the following kernel estimator given 
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samples S n = (b iy i"j)" =1 and tradeoff parameter c n , 



/n(6,r) = K 



n , 
i=l V 



b -bi,r - r, 



where: A'(x) = Jr_ lj+1 im+i(x)/2 



m+1 



(9) 



Let fi n denote the distribution given by the density function f n (b,r). Easy to check that fi n 
belongs to V n (A, (c n l n , ■ ■ ■ , c n l n ), b, c n l n ) and hence belongs to V(n) by definition. 

Step 2: Using the C 1 convergence property of the kernel density estimator, we prove the 
consistency of robust regression and equivalently Lasso. 

First notice that, ||x(c n , S n )\\2 < H and P has a bounded support implies that there exists a 
universal constant C such that 

max(6 — r T w(c n , S n )) 2 < C. 

b,r 

By Corollary [3] and p, n G V(n) we have 



/ (b - r T x(c n ,S n )) 2 d/l n (b, r) 

Jb,r 



< sup J I (b - r T x(c n ,S n )) 2 dfi(b,r) 



n 



A ~ *J X (Cn,<S n )) 2 + cJx^S^lli + C n 

\ i=1 



< 



11 



^(6 i -r7x(P)) 2 + c n ||x(P)|| 1 + c n , 
\ i=i 



the last inequality holds by definition of x(c n ,«S n ). 
Taking the square of both sides, we have 



/ (b- r T x(c n ,S n )) 2 dfi n (b, r) 

Jb,r 
1 - 



i=l 



2c (1 + llx 



1 n 

-^-r7x(P))^ 



Notice that, the right-hand side converges to f, (b — r T x(P)) 2 c/P(6, r) as n j oo and c„ j 
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almost surely. Furthermore, we have 

/ (6-r T x(c n ,S n ,)) 2 dP(6,r) 

Jb,r 



< (6-r x(c n ,S n )) dp. n (b,r) 

'b,r 



+ [max(6-r T x(c n ,5 n )) 2 ] / \f n (b, r) - /* (b, r)\d(b, r) 



< {b-r^(c n ,S n )Ydfl n (b,r) + C |/ n (6,r)-/*(6,r)|d(6,r), 

where the last inequality follows from the definition of C. Notice that J b \f n (b, r) —/*(&, r)\d(b, r) 
goes to zero almost surely when c n j and nc™ +1 f oo since f n (-) is a kernel density estimation 
°f /*(•) ( see e -g- Theorem 3.1 of [26]). Hence the theorem follows. ■ 

We can remove the assumption that ||x(c n , S n ) || 2 < H, and as in Theorem [71 the proof 
technique rather than the result itself is of interest. 

Theorem 8: Let {c n } converge to zero sufficiently slowly. Then 



Km ^ jf (b - rT x (c„, S n ))W(6, r) = ^ jf (6 - r T x (P))2 
almost surely. 

Proof: To prove the theorem, we need to consider a set of distributions belonging to V(n). 
Hence we establish the following lemma first. 

Lemma 1: Partition the support of P as Vi, • ■ ■ ,Vr such the £°° radius of each set is less than 
c n . If a distribution /x satisfies 

H{V t ) = G V t }\/n; t = l,---,T, (10) 

then /j G V(n). 

Proof: Let Z { = [6j — c n , 6j + c n ] x njLi[ a ii — c «> a «i + c «]' recall that a^- the j th element 
of rj. Notice V t has norm less than c n we have 

(6 <jri eV t )=>V t QZi. 

Therefore, for any S C {1, • • • , n}, the following holds 

/i(U^)>MU^I 3zG5:6 - r * G ^) 

= Yl M^)= E #((M«)eV,)/n>|S|/n. 

i|3ieS:& i ,r i eV r f t\9ieS:bi,Ti&V t 

Hence /x G V n (A, A, 6, c n ) where each element of A is c n , which leads to /x G "P(n). ■ 
Now we proceed to prove the theorem. Partition the support of P into T subsets such that £°° 
radius of each one is smaller than c n . Denote V(n) as the set of probability measures satisfying 
Equation (flOl) . Hence V(n) C V(n) by Lemma [TJ Further notice that there exists a universal 
constant A' such that ||x(c n , <S n )||2 < A/c n due to the fact that the square loss of the solution 
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x = is bounded by a constant only depends on the support of P. Thus, there exists a constant 
C such that max bir (b — r T x(c n , S n )) 2 < C/c 2 . 

Follow a similar argument as the proof of Theorem [/J we have 

sup / (6 - r T x(c n , S n )) 2 dfi n {b, r) 

At„SP(n) Jb,r 
1 - 

% 5> - r ^ x ( p )) 2 + c '( 1 + ii x ( p )iii) 2 (ID 

i=l 



+ 2 Cn (i + ||x(p)|| 1 ; 



1 n 

-5>-r7x(P))2, 

i=i 



and 



< 



/ (6-r T x(c„,5 n )) 2 rfP(6,r) 

</b,r 

inf { / (b - r T x(c n ,<S n )) 2 d/i n (6,r) 

+ max(6-r T x(c n , l S n )) 2 / 1/^(6, r) - f(b, r)\d{b, r) 

b < r Jb,r 



< sup (b-r x(c n ,<S n )) d/i n (b, r) 



+ 2C/c 2 inf { / |/^(6,r)-/(6,r)|d(6,r: 



here stands for the density function of a measure //. Notice that V n is the set of distributions 
satisfying Equation (flOl) . hence inf , e p( n ) J 6r |/|t{,(6, r) — f(b, r)\d(b, r) is upper-bounded by 

X^i l^(^t) — #(^5 r i £ ^)l/ n > which goes to zero as n increases for any fixed c n (see for 
example Proposition A6.6 of [30]). Therefore, 

2C/c 2 n inf if \f^(b,v)-f(b,v)\d(b,v)}^0, 

IM' n eV(n) *> ,/&,r J 

if c n | sufficiently slow. Combining this with Inequality (fTTI) proves the theorem. ■ 

VI. Stability 

Knowing that the robust regression problem (OQ) and in particular Lasso encourage sparsity, 
it is of interest to investigate another desirable characteristic of a learning algorithm, namely, 
stability. We show in this section that Lasso is not stable. This is a special case of a more general 
result we prove in [31], where we show that this is a common property for all algorithms that 
encourage sparsity. That is, if a learning algorithm achieves certain sparsity condition, then it 
cannot have a non-trivial stability bound. 

We recall the definition of uniform stability [29] first. We let Z denote the space of points 
and labels (typically this will be a compact subset of M n+1 ) so that S E Z m denotes a collection 
of m labelled training points. We let L denote a learning algorithm, and for S E Z m , we let L5 
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denote the output of the learning algorithm (i.e., the regression function it has learned from the 
training data). Then given a loss function I, and a labeled point s = (z, b) G Z, we let l(Ls, s) 
denote the loss of the algorithm that has been trained on the set S, on the data point s. Thus 
for squared loss, we would have l(L s , s) = ||Ls(z) — b\\ 2 . 

Definition 1: An algorithm L has uniform stability bound of (3 rn with respect to the loss 
function I if the following holds 

\/S G Z m , Vz G {1, • • • , m}, \\l(L s , •) - l(L sV , •) Hoc < p m . 

Here h s \i stands for the learned solution with the i th sample removed from S. 
At first glance, this definition may seem too stringent for any reasonable algorithm to exhibit good 
stability properties. However, as shown in [29], Tikhonox '-regularized regression has stability that 
scales as 1/m. Stability that scales at least as fast as o(^=) can be used to establish strong PAC 
bounds (see [29]). 

In this section we show that not only is the stability (in the sense defined above) of Lasso 
much worse than the stability of I 2 -regularized regression, but in fact Lasso's stability is, in the 
following sense, as bad as it gets. To this end, we define the notion of the trivial bound, which 
is the worst possible error a training algorithm can have for arbitrary training set and testing 
sample labelled by zero. 

Definition 2: Given a subset from which we can draw m labelled points, Z C R nx ( m+1 ) and 
a subset for one unlabelled point, X C M. m , a trivial bound for a learning algorithm L w.r.t. Z 
and X is 

b(h,Z,X) = max Z(L s ,(z,0)). 
As above, /(•,•) is a given loss function. 

Notice that the trivial bound does not diminish as the number of samples increases, since by 
repeatedly choosing the worst sample, the algorithm will yield the same solution. 

Now we show that the uniform stability bound of Lasso can be no better than its trivial bound 
with the number of features halved. 

Theorem 9: Given Z C ]g"-x( 2m +i) be the domain of sample set and X C R 2m be the domain 
of new observation, such that 

(b,A) £2^ (b,A,A) G X, 

(z T ) G X (z T ,z T ) G X. 

Then the uniform stability bound of Lasso is lower bounded by b (Lasso, Z, X). 

Proof: Let (h*,A*) and (0, z* T ) be the sample set and the new observation such that 
they jointly achieve b (Lasso, Z, X), and let x* be the optimal solution to Lasso w.r.t (h*,A*). 
Consider the following sample set 

/ b * A * A * \ 
T z* T J ■ 

Observe that (x T ,0 T ) T is an optimal solution of Lasso w.r.t to this sample set. Now remove 
the last sample from the sample set. Notice that (0 T ,x T ) T is an optimal solution for this new 
sample set. Using the last sample as a testing observation, the solution w.r.t the full sample set 
has zero cost, while the solution of the leave-one-out sample set has a cost b (Lasso, Z, X). And 
hence we prove the theorem. ■ 
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VII. Conclusion 

In this paper, we considered robust regression with a least-square-error loss. In contrast to 
previous work on robust regression, we considered the case where the perturbations of the 
observations are in the features. We show that this formulation is equivalent to a weighted i 1 
norm regularized regression problem if no correlation of disturbances among different features 
is allowed, and hence provide an interpretation of the widely used Lasso algorithm from a 
robustness perspective. We also formulated tractable robust regression problems for disturbance 
coupled among different features and hence generalize Lasso to a wider class of regularization 
schemes. 

The sparsity and consistency of Lasso are also investigated based on its robustness interpre- 
tation. In particular we present a "no-free-lunch" theorem saying that sparsity and algorithmic 
stability contradict each other. This result shows, although sparsity and algorithmic stability are 
both regarded as desirable properties of regression algorithms, it is not possible to achieve them 
simultaneously, and we have to tradeoff these two properties in designing a regression algorithm. 

The main thrust of this work is to treat the widely used regularized regression scheme from 
a robust optimization perspective, and extend the result of [13] (i.e., Tikhonov regularization is 
equivalent to a robust formulation for Frobenius norm bounded disturbance set) to a broader range 
of disturbance set and hence regularization scheme. This provides us not only with new insight 
of why regularization schemes work, but also offer solid motivations for selecting regularization 
parameter for existing regularization scheme and facilitate designing new regularizing schemes. 

References 

[1] L. Elden. Perturbation theory for the least-square problem with linear equality constraints. BIT, 24:472-476, 1985. 

[2] G. Golub and C. Van Loan. Matrix Computation. John Hopkins University Press, Baltimore, 1989. 

[3] D. Higham and N. Higham. Backward error and condition of structured linear systems. SIAM Journal on Matrix Analysis 

and Applications, 13:162-175, 1992. 
[4] R. Fierro and J. Bunch. Collinearity and total least squares. SI AM Journal on Matrix Analysis and Applications, 15:1167- 

1181, 1994. 

[5] A. Tikhonov and V. Arsenin. Solution for Ill-Posed Problems. Wiley, New York, 1977. 

[6] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 
58(l):267-288, 1996. 

[7] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2):407^199, 2004. 
[8] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 
20(1):33-61, 1998. 

[9] A. Feuer and A. Nemirovski. On sparse representation in pairs of bases. IEEE Transactions on Information Theory, 
49(6): 1579-1581, 2003. 

[10] E. Candes, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete 

frequency information. IEEE Transactions on Information Theory, 52(2):489-509, 2006. 
[11] J. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information Theory, 

50(10):2231-2242, 2004. 

[12] M. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of spar- 

sity using £\ -constrained quadratic programming. Technical Report Available from: 

http : / / http : / / www .stat.berkeley.edu/tech-reports/709.pdf, Department of Statistics, UC Berkeley, 
2006. 

[13] L. El Ghaoui and H. Lebret. Robust solutions to least-squares problems with uncertain data. SIAM Journal on Matrix 

Analysis and Applications, 18:1035-1064, 1997. 
[14] P. Shivaswamy, C. Bhattacharyya, and A. Smola. Second order cone programming approaches for handling missing and 

uncertain data. Journal of Machine Learning Research, 7:1283-1314, July 2006. 
[15] D. Bertsimas and I. Popescu. Optimal inequalities in probability theory: A convex optimization approach. SIAM Journal 

of Optimization, 15(3):780-800, 2004. 
[16] H. Xu, C. Caramanis, and S. Mannor. Robust regression and Lasso. Technical report, Gerad, Available from 

http : / / www . cim. mcgill . ca/ ~xuhuan/LassoGerad . pdf , 2008. 



November 11, 2008 



DRAFT 



18 



[17] D. Bertsimas and M. Sim. The price of robustness. Operations Research, 52(1):35— 53, January 2004. 
[18] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal Of The Royal Statistical Society 
Series B, 67(2):301-320, 2005. 

[19] J. Tropp. Just relax: Convex programming methods for identifying sparse signals. IEEE Transactions on Information 

Theory, 51(3): 1030-1051, 2006. 
[20] F. Girosi. An equivalence between sparse approximation and support vector machines. Neural Computation, 10(6): 1445- 

1480, 1998. 

[21] R. R. Coifman and M. V. Wickerhauser. Entropy-based algorithms for best-basis selection. IEEE Transactions on 

Information Theory, 38(2):713-718, 1992. 
[22] S. Mallat and Z. Zhang. Matching Pursuits with time-frequence dictionaries. IEEE Transactions on Signal Processing, 

41(12):3397-3415, 1993. 

[23] D. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4): 1289-1306, 2006. 
[24] M. Rosenblatt. Remarks on some nonparametric estimates of a density function. Annals of Mathematical Statistics, 
27:832-837, 1956. 

[25] E. Parzen. On the estimation of a probability density function and the mode. Annals of Mathematical Statistics, 33:1065- 
1076, 1962. 

[26] L. Devroye and L. Gyorfi. Nonparametric Density Estimation: the h View. John Wiley & Sons, 1985. 

[27] D. Scott. Multivariate Density Estimation: Theory, Practice and Visualization. John Wiley & Sons, 1992. 

[28] V. Vapnik and A. Chervonenkis. The necessary and sufficient conditions for consistency in the empirical risk minimization 

method. Pattern Recognition and Image Analysis, l(3):260-284, 1991. 
[29] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2:499-526, 2002. 
[30] A. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer- Verlag, New York, 2000. 
[31] H. Xu, C. Caramanis, and S. Mannor. Sparse algorithms are not stable: A no-free-lunch theorem. In Proceedings of 

Forty-Sixth Allerton Conference on Communication, Control, and Computing, 2008. 



Appendix A 
Proof of Theorem [2] 

Theorem |2l Consider a random vector v G M n , such that E(v) = a, and E(vv T } 
S >- 0. Then we have 



Pr{||v|| 2 > Ci }<{ 



mm Piqir . A 
subject to: 



Trace(EP) + 2q T a + r 
P q 




>- 







-< A 



P 



q 

r - 



(12) 



Proof: Consider a function /(■) parameterized by P, q, r defined as /(v) = v T Pv+2q T v+ 
r. Notice E(/(v)) = Trace(EP) +2q T a + r. Now we show that /(v) > l|| v ||> Ci for all P, q, r 
satisfying the constraints in (fl"2l . 

To show /(v) > liiv|| 2 >ci> we need to establish (i) /(v) > for all v, and (ii) /(v) > 1 when 
II v II 2 > Q- Notice that 

' P q 



/(v) 



q 



hence (i) holds because 



P q 

qT r 



y o. 



To establish condition (ii), it suffices to show v T v > cf implies v T Pv + 2q T v + r > 1, which 
is equivalent to show {v|v T Pv + 2q T v + r — 1 < 0} C |v|v T v < c?}. Noticing this is an 
ellipsoid-containment condition, by S-Procedure, we see that is equivalent to the condition that 
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there exists a A > such that 



I(m) \, ( P q 



T -cf y " " ^ q 1 r 

Hence we have /(v) > lii v il2>ci» taking expectation over both side that notice that the 
expectation of a indicator function is the probability, we establish the theorem. ■ 

Appendix B 
Proof of Theorem [4] 

Theorem 31 Assume that the set 

Z 4 { z € K"U-(z) < 0, j = l,---,k; z>0} 
has non-empty relative interior. Then the robust regression problem 

min < max lib — (A + AA)x|L 

is equivalent to the following regularized regression problem 

|||b - Ax|| + v(\, k,x)|; 



mm 



where: v(X,k,x) = max 
ceR m 



:^+|x|) T c-^A J / J (c) 



Proof: Fix a solution x*. Notice that, 

W = >*m)|c e ||<5i|| a < Cj, i = ,m}. 



Hence we have: 



max lib - (A + AA)x*|L 

AAeW 



max 



j max ||b - (A + (<5i, • • • , <5 m ))x*|| a \ 

{ \\Si\\a<Ci,l=l,— ,m ) 
m 

rnax |||b - Ax*|| a + ^q|x*| j 



(13) 



i=i 

= ||b - Ax*\\ a + max {|x*| T c 

The second equation follows from Theorem [3] 

Now we need to evaluate max ce 2{|x*| T c}, which equals to — min ce 2{ — |x*| T c}. Hence we 
are minimizing a linear function over a set of convex constraints. Furthermore, by assumption 
the Slater's condition holds. Hence the duality gap of min c62 { — |x*| T c} is zero. A standard 
duality analysis shows that 



max<|x*| c>= min i>(A, K, x*). (14) 
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We establish the theorem by substituting Equation (fl4l) back into Equation (fT3l) and taking 
minimum over x on both sides. ■ 

Appendix C 
Proof of Proposition Q] 

Proposition [H Given a function g : R m+1 — > R and Borel sets Z\, ■ ■ ■ , Z n C R m+1 , let 

V n ^{/2e V\VS C {1, • ■ • , n} : /x(|J Z<) > l^/n}. 

ies 

The following holds 

1 " /" 

— 2 sup h(ri,bi) = sup / h(r,b)dfj,(r,b). 

n ~l {ri,k)eZi /j,ev n Jr™+i 

Proof: To prove Proposition [H we first establish the following lemma. 
Lemma 2: Given a function / : R m+1 — »■ R, and a Borel set Z C R m+1 , the following holds: 



sup /(x') = sup / /(x)d//(x). 

x'e2 u67'|u(2)=l ^M m + 1 



Proof: Let x be a e— optimal solution to the left hand side, consider the probability measure 
fjf that put mass 1 on x, which satisfy fjf(Z) = 1. Hence, we have 



sup /(x') — e < sup / 

x'e2 uev\u(Z)=i Jr™+i 



/(x)d^(x), 



MeP|At(^)= 

since e can be arbitrarily small, this leads to 



sup /(x') < sup / /(x)d^(x). (15) 

Next construct function / : R m+1 — > R as 

f fx) = J ^ X E Z] 
^ ' 1 /(x) otherwise. 

By definition of x we have /(x) < /(x) + e for all x e R m+1 . Hence, for any probability 
measure /i such that ji(Z) = 1, the following holds 

/ f{x)dfi(x) < / f{x)dfi{x) + e = /(x) + e < sup /(x') + e. 

sup / f(x)dfi(x) < sup /(x') + e. 
>\u(Z)=l Jm^+i x'e2 



This leads to 



»ep\ii{z) 

Notice e can be arbitrarily small, we have 



Combining (fl"5l) and (fl6l) , we prove the lemma 



sup / f(x)dfj,(x) < sup /(x') (16) 

'|u(2)=i Jr™+i x'e2 
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Now we proceed to prove the proposition. Let Xj be an e— optimal solution to sup x . 6 ^. /(xj). 
Observe that the empirical distribution for (xi, • • • , x n ) belongs to V„, since e can be arbitrarily 
close to zero, we have 

1 n f 

- V" sup /(xj) < sup / /(x)d/x(x). (17) 

Without loss of generality, assume 

/(*i)</(* 2 ) <■••</(*„). (18) 
Now construct the following function 

/(x) a ( ™ in f /(**) xe u; =1 % (19) 

^ w \ /(x) otherwise. 

Observe that /(x) < /(x) + e for all x. 
Furthermore, given fi E V n , we have 

/ /(X)d/*(X) - 6 

/(x)cZ^(x) 

fi 

=E/(^)[mU^)-mU^) 

fe=l i=l t=l 

MULi - MLfci ^i)] . we have 

n t 

y]ak = l, 7^ « fc > t/n. 

fc=i fe=i 

Hence by Equation (TTSl we have 

n 1 71 

<-x; /(**). 

fc=i /t=i 

Thus we have for any fj, E V n , 

r 1 n 

/ /(x)dMx)-e<-^/(x,). 



Denote a* = 



fc=i 



Therefore, 



sup 



/ /(x)d/x(x) - e < sup - YV(xfc). 



Notice e can be arbitrarily close to 0, we proved the proposition by combining with (fT7l) . 
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Proof of Corollary [3] 

Corollary HI Given b G W 1 , A G IR nxm , the following equation holds for any x G 



Here, 



||b — Ax|| 2 + \/ric n ||x||i + v^c n = sup Wn / (6' — r /T x) 2 c//i(r / , V). 
PW= |J V n (A,A,h,a); 

\\tr\\2<V^ c n-yi-\\Si || 2 <Vnc n 

m 

V n (A, A, b, a) = {/i G = [fej - o-j, fej + x JJ[a»j - + 5, 

i=i 

V5C{l,...,n}:/i(|JZ i )>|5|/n}. 
Proof: The right-hand-side of Equation ( |20l) equals 



sup < sup 

lo-I^^VncniVitliail^^Vncn ^ ^V n (AA,b,(r) 



[ {V -r' T x) 2 dfi(r\b')\. 



Notice by the equivalence to robust formulation, the left-hand- side equals to 

b + cr- (A+ [6 ir -- ,S m ])x 



_ max 

II o'll 2 <\ZnCn;Vi: \\ Si \\ 2< s/nc n 



sup 



sup 



2<v / nc„;Vj:||<5 1 :|| 2 <v / "c n I (g jt f Oel&j-crj.bi+crj] x fljli Aij+$ij] \ i=l 



f, T x) 2 



sup 



f 2! 



E 



sup 



2< > /Hc rl ;Vi:||<5 i ||2<v / nc n ^ i=1 (6i,ri)e[6 i -<T 4 ,b i +(T i ]xn^i[a<i-5y,ay+ l 5 i;( ] 

furthermore, applying Proposition [Q yields 



(6j - r^x 



5^ sup (6j - r^x) 2 

\ i=i (6t,fi)6[6i-CT i ,6j+CTi]xn^i[aij-*ij!Oij+'5ij] 

sup n / (6' — r /T x) 2 d//(r', £/) 

/ieP„(A,A,b,tr) Jw n + 1 



sup 

AteP„(A,A,b,<x) 



'n / {V -r /T x) 2 rf/i(r',6'), 



which proves the corollary. 
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